EDA Practice

EDA Practice

This is a simplified version of the EDA Practice notebook.

In [ ]:
import pandas as pd 
import numpy as np

# Create a simple DataFrame
df = pd.DataFrame({
    'A': np.random.rand(5),
    'B': np.random.rand(5)
})

print(df.head())

LOAD DATA

In [68]:
df = pd.read_csv("http://www.ishelp.info/data/insurance.csv")

SUMMERY_OF_DATA

In [70]:
from autoviz.AutoViz_Class import AutoViz_Class
AV = AutoViz_Class()
df_av = AV.AutoViz('http://www.ishelp.info/data/insurance.csv')
Imported v0.1.58. After importing, execute '%matplotlib inline' to display charts in Jupyter.
    AV = AutoViz_Class()
    dfte = AV.AutoViz(filename, sep=',', depVar='', dfte=None, header=0, verbose=1, lowess=False,
               chart_format='svg',max_rows_analyzed=150000,max_cols_analyzed=30, save_plot_dir=None)
Update: verbose=0 displays charts in your local Jupyter notebook.
        verbose=1 additionally provides EDA data cleaning suggestions. It also displays charts.
        verbose=2 does not display charts but saves them in AutoViz_Plots folder in local machine.
        chart_format='bokeh' displays charts in your local Jupyter notebook.
        chart_format='server' displays charts in your browser: one tab for each chart type
        chart_format='html' silently saves interactive HTML files in your local machine
Shape of your Data Set loaded: (1338, 7)
#######################################################################################
######################## C L A S S I F Y I N G  V A R I A B L E S  ####################
#######################################################################################
Classifying variables in data set...
Data cleaning improvement suggestions. Complete them before proceeding to ML modeling.
  Nuniques dtype Nulls Nullpercent NuniquePercent Value counts Min Data cleaning improvement suggestions
charges 1337 float64 0 0.000000 99.925262 0 skewed: cap or drop outliers
bmi 548 float64 0 0.000000 40.956652 0
age 47 int64 0 0.000000 3.512706 0
children 6 int64 0 0.000000 0.448430 0
region 4 object 0 0.000000 0.298954 324
sex 2 object 0 0.000000 0.149477 662
smoker 2 object 0 0.000000 0.149477 274
    7 Predictors classified...
        No variables removed since no ID or low-information variables found in data set
Number of All Scatter Plots = 3
All Plots done
Time to run AutoViz = 84 seconds 

 ###################### AUTO VISUALIZATION Completed ########################
In [71]:
pp.ProfileReport(df)
Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]
Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]
Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]
Out[71]:

In [ ]:
df.head()
In [5]:
df.describe()
Out[5]:
age bmi children charges
count 1338.000000 1338.000000 1338.000000 1338.000000
mean 39.207025 30.663397 1.094918 13270.422265
std 14.049960 6.098187 1.205493 12110.011237
min 18.000000 15.960000 0.000000 1121.873900
25% 27.000000 26.296250 0.000000 4740.287150
50% 39.000000 30.400000 1.000000 9382.033000
75% 51.000000 34.693750 2.000000 16639.912515
max 64.000000 53.130000 5.000000 63770.428010
In [6]:
df.shape
Out[6]:
(1338, 7)
In [45]:
df.columns
Out[45]:
Index(['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges'], dtype='object')
In [46]:
print (f'age: {df.age.count()}')
print (f'sex: {df.sex.count()}')
print (f'bmi: {df.bmi.count()}')
print (f'children: {df.children.count()}')
print (f'smoker: {df.smoker.count()}')
print (f'region: {df.region.count()}')
print (f'charges: {df.charges.count()}')
age: 1338
sex: 1338
bmi: 1338
children: 1338
smoker: 1338
region: 1338
charges: 1338

missing_value

In [47]:
print (f'age: {df.age.nunique()}')
print (f'sex: {df.sex.nunique()}')
print (f'bmi: {df.bmi.nunique()}')
print (f'children: {df.children.nunique()}')
print (f'smoker: {df.smoker.nunique()}')
print (f'region: {df.region.nunique()}')
print (f'charges: {df.charges.nunique()}')
age: 47
sex: 2
bmi: 548
children: 6
smoker: 2
region: 4
charges: 1337

typing

In [48]:
print (f'age : {df.age.dtype}')
print (f'sex: {df.sex.dtype}')
print (f'bmi: {df.bmi.dtype}')
print (f'children: {df.children.dtype}')
print (f'smoker: {df.smoker.dtype}')
print (f'region: {df.region.dtype}')
print (f'charges: {df.charges.dtype}')
age : int64
sex: object
bmi: float64
children: int64
smoker: object
region: object
charges: float64

isnull

In [50]:
print (f'age : {df.age.isnull().sum()}')
print (f'sex: {df.sex.isnull().sum()}')
print (f'bmi: {df.bmi.isnull().sum()}')
print (f'children: {df.children.isnull().sum()}')
print (f'smoker: {df.smoker.isnull().sum()}')
print (f'region: {df.region.isnull().sum()}')
print (f'charges: {df.charges.isnull().sum()}')
age : 0
sex: 0
bmi: 0
children: 0
smoker: 0
region: 0
charges: 0
In [55]:
for col in df : 
 print(df.isnull().sum())
age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64
age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64
age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64
age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64
age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64
age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64
age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64
In [59]:
df.values
Out[59]:
array([[19, 'female', 27.9, ..., 'yes', 'southwest', 16884.924],
       [18, 'male', 33.77, ..., 'no', 'southeast', 1725.5523],
       [28, 'male', 33.0, ..., 'no', 'southeast', 4449.462],
       ...,
       [18, 'female', 36.85, ..., 'no', 'southeast', 1629.8335],
       [21, 'female', 25.8, ..., 'no', 'southwest', 2007.945],
       [61, 'female', 29.07, ..., 'yes', 'northwest', 29141.3603]],
      dtype=object)

LABEL_ENCODER

In [61]:
#replaced_base['Ward']=dropped_base['Ward'].replace(['BMT1', 'BMT2', "BMT3","BMT4"],
                #[0, 1, 2 , 3], inplace=False)
In [76]:
from sklearn import preprocessing 
le = preprocessing.LabelEncoder()
df["sex"] = le.fit_transform(df["sex"])
In [77]:
df
Out[77]:
age sex bmi children smoker region charges
0 19 0 27.900 0 yes southwest 16884.92400
1 18 1 33.770 1 no southeast 1725.55230
2 28 1 33.000 3 no southeast 4449.46200
3 33 1 22.705 0 no northwest 21984.47061
4 32 1 28.880 0 no northwest 3866.85520
... ... ... ... ... ... ... ...
1333 50 1 30.970 3 no northwest 10600.54830
1334 18 0 31.920 0 no northeast 2205.98080
1335 18 0 36.850 0 no southeast 1629.83350
1336 21 0 25.800 0 no southwest 2007.94500
1337 61 0 29.070 0 yes northwest 29141.36030

1338 rows × 7 columns

In [78]:
for col in df.columns:
    if df.dtypes[col] == 'object':
        print(col)
        #base[col] = le.fit_transform(base[col])
smoker
region
In [79]:
dropped_df=df.dropna(axis=1, how='all', thresh=200, subset=None, inplace=False)
dropped_df
Out[79]:
age sex bmi children smoker region charges
0 19 0 27.900 0 yes southwest 16884.92400
1 18 1 33.770 1 no southeast 1725.55230
2 28 1 33.000 3 no southeast 4449.46200
3 33 1 22.705 0 no northwest 21984.47061
4 32 1 28.880 0 no northwest 3866.85520
... ... ... ... ... ... ... ...
1333 50 1 30.970 3 no northwest 10600.54830
1334 18 0 31.920 0 no northeast 2205.98080
1335 18 0 36.850 0 no southeast 1629.83350
1336 21 0 25.800 0 no southwest 2007.94500
1337 61 0 29.070 0 yes northwest 29141.36030

1338 rows × 7 columns

Importing Needed packages

In [8]:
import matplotlib.pyplot as plt
import pandas as pd
import pylab as pl
import numpy as np
%matplotlib inline

Downloading Data

In [9]:
!wget -O FuelConsumption.csv https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%202/data/FuelConsumptionCo2.csv
--2023-02-26 13:00:04--  https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%202/data/FuelConsumptionCo2.csv
Connecting to 127.0.0.1:10077... connected.
Proxy request sent, awaiting response... 200 OK
Length: 72629 (71K) [text/csv]
Saving to: ‘FuelConsumption.csv’

FuelConsumption.csv 100%[===================>]  70.93K  57.7KB/s    in 1.2s    

2023-02-26 13:00:10 (57.7 KB/s) - ‘FuelConsumption.csv’ saved [72629/72629]

Reading the data in

In [4]:
df = pd.read_csv("FuelConsumption.csv")
In [5]:
df.head()
# take a look at the dataset
Out[5]:
MODELYEAR MAKE MODEL VEHICLECLASS ENGINESIZE CYLINDERS TRANSMISSION FUELTYPE FUELCONSUMPTION_CITY FUELCONSUMPTION_HWY FUELCONSUMPTION_COMB FUELCONSUMPTION_COMB_MPG CO2EMISSIONS
0 2014 ACURA ILX COMPACT 2.0 4 AS5 Z 9.9 6.7 8.5 33 196
1 2014 ACURA ILX COMPACT 2.4 4 M6 Z 11.2 7.7 9.6 29 221
2 2014 ACURA ILX HYBRID COMPACT 1.5 4 AV7 Z 6.0 5.8 5.9 48 136
3 2014 ACURA MDX 4WD SUV - SMALL 3.5 6 AS6 Z 12.7 9.1 11.1 25 255
4 2014 ACURA RDX AWD SUV - SMALL 3.5 6 AS6 Z 12.1 8.7 10.6 27 244

Data Exploration

Let's first have a descriptive exploration on our data

In [10]:
# summarize the data
df.describe()
Out[10]:
MODELYEAR ENGINESIZE CYLINDERS FUELCONSUMPTION_CITY FUELCONSUMPTION_HWY FUELCONSUMPTION_COMB FUELCONSUMPTION_COMB_MPG CO2EMISSIONS
count 1067.0 1067.000000 1067.000000 1067.000000 1067.000000 1067.000000 1067.000000 1067.000000
mean 2014.0 3.346298 5.794752 13.296532 9.474602 11.580881 26.441425 256.228679
std 0.0 1.415895 1.797447 4.101253 2.794510 3.485595 7.468702 63.372304
min 2014.0 1.000000 3.000000 4.600000 4.900000 4.700000 11.000000 108.000000
25% 2014.0 2.000000 4.000000 10.250000 7.500000 9.000000 21.000000 207.000000
50% 2014.0 3.400000 6.000000 12.600000 8.800000 10.900000 26.000000 251.000000
75% 2014.0 4.300000 8.000000 15.550000 10.850000 13.350000 31.000000 294.000000
max 2014.0 8.400000 12.000000 30.200000 20.500000 25.800000 60.000000 488.000000

Let's select some features to explore more.

In [12]:
cdf = df[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB','CO2EMISSIONS']]
cdf.head(9)
Out[12]:
ENGINESIZE CYLINDERS FUELCONSUMPTION_COMB CO2EMISSIONS
0 2.0 4 8.5 196
1 2.4 4 9.6 221
2 1.5 4 5.9 136
3 3.5 6 11.1 255
4 3.5 6 10.6 244
5 3.5 6 10.0 230
6 3.5 6 10.1 232
7 3.7 6 11.1 255
8 3.7 6 11.6 267

We can plot each of these features:

In [13]:
viz = cdf[['CYLINDERS','ENGINESIZE','CO2EMISSIONS','FUELCONSUMPTION_COMB']]
viz.hist()
plt.show()
No description has been provided for this image

Now, let's plot each of these features against the Emission, to see how linear their relationship is:

In [15]:
plt.scatter(cdf.FUELCONSUMPTION_COMB, cdf.CO2EMISSIONS,  color='blue')
plt.xlabel("FUELCONSUMPTION_COMB")
plt.ylabel("Emission")
plt.show()
No description has been provided for this image
In [ ]:
 

Minimal EDA

Minimal EDA Notebook

This is a simplified version for testing.

In [ ]:
import pandas as pd
import numpy as np

# Create a simple DataFrame
df = pd.DataFrame({
    'A': np.random.rand(5),
    'B': np.random.rand(5)
})

print(df)

test notebook

This is a test

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 1
----> 1 import numpy as np
      2 import pandas as pd
      3 import seaborn as sns

ModuleNotFoundError: No module named 'numpy'